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Statistical mechanics is used to study unrealizable generalization in two large feed-forward neural networks with 
binary weights and output, a perceptron and a tree committee machine. The student is trained by a teacher being 
larger, i.e. having more units than the student. It is shown that this is the same as using training data corrupted by 
Gaussian noise. Each machine is considered in the high temperature limit and in the replica symmetric approximation 
as well as for one step of replica symmetry breaking. For the perceptron a phase transition is found for low noise. 
However the transition is not to optimal learning. If the noise is increased the transition disappears. In both cases 
e g will approach optimal performance with a (In a/a) k decay for large a. For the tree committee machine noise in 
the input layer is studied, as well as noise in the hidden layer. If there is no noise in the input layer there is, in the 
case of one step of replica symmetry breaking, a phase transition to optimal learning at some finite a for all levels 
C"| of noise in the hidden layer. When noise is added to the input layer the generalization behavior is similar to that of 

the perceptron. For one step of replica symmetry breaking, in the realizable limit, the values of the spinodal points 
' O found in this paper disagree with previously reported estimates [1],[2]. Here the value a sp = 2.79 is found for the 

C tree committee machine and a sp = 1.67 for the perceptron. 

o 

PACS: 87.10, 02.50, 05.20, 64.60C 



1. Introduction 

A Feed-forward neural network can be used to es- 
timate an unknown rule from random examples [3] 
by adaption of its weights. Using methods from sta- 
tistical mechanics of disordered systems [4] the per- 
formance of a student network trained on examples 
obtained from a teacher network of the same archi- 
tecture has been studied (for a review see [5]). In 
this case the rule is said to be realizable since it is 
possible for the student to develop the same weights 
as the teacher. 

One way to construct an unrealizable rule is to 
allow for a teacher that is larger (more units) than 



the student. This will be shown to be equivalent to 
adding Gaussian noise to the training set. The noisy 
data scenario has been investigated for networks with 
continuous weights [6], [7], [8]. In the limit where the 
teacher is infinitely larger than the student (large 
noise limit) the only thing the student can do is to 
learn each example by heart, and in this limit the 
problem reduces to that of storage capacity. 

In this paper the generalization behavior of two 
different types of binary neural networks with binary 
weights is studied, a perceptron (section 2) and a tree 
committee machine (section 3), in the limit where the 
number of units is large. The rule is defined by a 
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teacher of the same type but having more units than 
the student, making the task unrealizable. 

The training of the student, having N units, is 
based on aN examples obtained by picking inputs 
£^ and assigning outputs r' 1 as given by the teacher. 
With u' 1 being the [ith output of the student, a train- 
ing energy E = ^2 9(—<t^t^) is defined, which leads 
to a probability density with Boltzmann weight e~^ E , 
where [3 = l/T is the inverse temperature. First the 
high temperature limit is considered for each type of 
network, the perceptron in section (2.1) and the tree 
committee machine in section (3.1). Then, in sections 
(2.2) and (3.2), the replica trick is used, assuming 
replica symmetry (RS), to study the average over all 
training sets of the free energy, f3f. In sections (2.3) 
and (3.3) the corrections given by one step of replica 
symmetry breaking (RSB) are discussed. Since, in 
the noiseless limit, the value of the spinodal point 
found in the RSB-case disagrees with previously re- 
ported estimates [1],[2], some time is spent on the 
saddle point equations in appendix A. Finally, in ap- 
pendix B, the procedure for finding the asymptotic 
generalization behavior for large a is given. 



2. A Large Binary Perceptron. 

Let the student and the teacher have N and M input 
units respectively with N < M. Presented an in- 
put, s, the teacher evaluates, r(s) = sgn(v • s), while 
the student computes, cr(so) = sgn(wo -so), given the 
input so • Here s and v are elements of 1Z M , while 
vectors having a zero subscript are elements of 1Z N . 
When the student is presented the same input vec- 
tor, £, as the teacher, it only considers the N first 
components, £o- Thus the target rule will be, 



r(0 



sgn 




sgn 



(2.1) 



where vo is constructed from the first N components 
of v. Effectively this means that the student will be 
given the task t'(£o) = sgn(t/o • £o) with noise on 
the input vector Co = <fo + and/or on the weight 
vector Jo = vq + Since rj is constructed from 



independent Gaussian random variables, vj and £j 
(j = N + 1, M) with unit variance, rj will also be 
Gaussian with variance, 
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where 7 = \JN jM . 7 has the simple interpretation 
of the relative size of the student to the teacher. If 
7=1 the student and the teacher are of the same 
size, i.e. there is no noise. If 7 = the teacher is 
infinitely larger than the student, i.e. the data will 
be completely noisy. 

The generalization error, e g , obtained by taking the 
average of 9(—<tt) over normal distributed inputs, Sj 
(j = l,...,M),is 



e g = — arccos(7_R) 

7T 



(2.3) 



where R is the overlap between iZ>o and vq. For R = 1 
we obtain the optimal value, e opt , of e g . 

First the high temperature limit is considered. 
Then by using the replica method, the RS approx- 
imation is studied, and finally the corrections given 
by one-step RSB are discussed. 



2.1. High Temperature Limit 

In previous work [1] the high temperature limit has 
proven to be interesting since it is both computation- 
ally easy and gives the general behavior of learning. 
It is defined so that both a and T approach infinity 
while a/3 remains constant. The free energy is simply 



1-R ,1-R. 1 + R, A + R, 



a/3 



arccos(7_R) 



(2.1.1) 



The qualitative behavior of the learning curves can 
be divided into two types depending on whether the 
noise level is above or below a particular value 70. For 
70 < 7 < 1 there is, as in the realizable case, a range 
{af3) sp i < a/3 < (af3) sp 2, for which f3f has two min- 
ima. In between (a/3) sp i and (a/3) sp 2 there is a tran- 
sition point (af3)tr at which the global properties of 
the minima change. In contrast to the noiseless case, 
{af3) sp i > and thus for < a/3 < {af3) sp i there 
is only one minimum. The minimum persisting also 
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Figure 1: The picture show learning curves for the 
perceptron in the high temperature limit at three dif- 
ferent noise levels. At j = 0.999 there are two states 
a stable (solid line) and a metastable (dotted) one re- 
sulting in a phase transition. At j = 0.965 (dashed) 
the spinodal points and the transition point merge and 
finally at j = 0.9 (dashed- dotted) there is no phase 
transition. 

for afi > (a/3) sp 2 is close to R = 1 and approaches 
optimal performance as afi increase. Note that in 
contrast to the realizable case there is no solution at 
R = 1. Typically (a(i) tr , (a(i) sp i and (a(i) sp2 in- 
crease with decreasing 7 and merge at 7 = 70. This 
is illustrated in figure (1). 

The two minima of (if must be separated by a max- 
imum, implying that gj^f = at the spinodal points. 
Using the saddle point equation, 



a(Jy _ 1 l J I + R 
tt^I - 7 2 i? 2 ~2 n Vl-i? 



:arctanh(i?) , (2.1.2) 



to eliminate aji from ^ ^/ = 0, gives 
l-y 2 R 2 1 



i? = tanh 



1 2 R(l-R 2 )_ 



g(R,j) . (2.1.3) 



For 7 = 1, (2.1.3) has one solution, R sp = 0.83 re- 
sulting in (afi) sp = 2.08 in agreement with [1]. In 
the region 70 < 7 < 1 (2.1.3) has two solutions giv- 
ing (afi) sp i and (afi) sp 2. At 7 = 70 the two solutions 
merge and the two curves R and g(R, 7) are tangent 
to each other. Thus 70 can be found by solving, 
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(2.1.4) 
(2.1.5) 



giving 70 = 0.965. For 7 < 70, (if has only one min- 
imum (for all afi) which moves towards R = 1 as afi 
approaches infinity. Note that fairly small amounts of 
noise will change the qualitative behavior from phase 
transition to no transition. 

In weight space this behavior can be understood 
as follows. In the noiseless case there are, for small 
a, two regions in weight space corresponding to the 
minima of (if , one with poor generalization and one 
with good. If a is small enough the "poor" region 
has the lowest free energy. As a increase the "poor" 
region moves towards the "good" and for a > a tr the 
"good" region has the lowest free energy. Since for 
a = the "poor" and "good" regions are separated, 
there will be a phase transition. 

If noise is added, the sizes of these regions will in- 
crease. For low a there is only one region in weight 
space corresponding to a minimum of (if . It will have 
poor generalization. At a = a sp i another region cor- 
responding to a free energy minimum appears. This 
region gives better generalization. Again as a in- 
crease the "poor" region moves towards the "good" 
and for a > a tr the "good" region has the lowest free 
energy. Since for a = a tr the "poor" and "good" re- 
gions are separated there will be a phase transition. 
If the noise is increased the "poor" region is so large 
that when the "good" region is created it will over- 
lap with the "poor". Thus there is only one region, 
moving towards better generalization and there is no 
phase transition. 

2.2. Replica Symmetric Theory 

Using the same methods as in [1] the RS approxima- 
tion to the free energy is obtained, 



Mrs 
G r 
G s 



extr 



Gr(R,q,a, 1 ,li) + G s (R,R,q,q)\ 



2a Du H 



jRu 



V u 



-(l-q)q + RR 



- Jdu In [2 cosh (jl + \fq~u} ] , 
V(x) = In [e 13 + (1 - e' 13 ) H(x)] . (2.2.1) 

The saddle point equations generated by the ex- 
tremal condition in (2.2.1) is given in appendix A. 
Here q is the typical overlap between two different 
wo- R, 1 an d a have the interpretation given above. 
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Figure 2: Critical capacity for the perceptron in the 
RS- approximation. The upper curve shows a c as a 
function of noise and the lower shows R c . 

Using the saddle point equations we can, given j and 
(3, eliminate the auxiliary variables R and q, and find 
the dependence of R (and e g ) on a. 

First consider the zero temperature case. This cor- 
responds to only allowing students that answers all 
questions correctly. If j < 1 the training data is 
noisy and there is a maximum size of the training set 
a c N beyond which no student can perform optimally. 
a c (j) and R c (j) are plotted in figure (2). For 7 = 
the known result of Gardner [9] is reproduced. Note 
that the curves do not give a c — ► oo as j — ► 1. How- 
ever this may not be expected since the curves only 
give correct predictions for states that are stationary 
points of /3fns an d in the realizable case the state 
72 = 1 is not stationary as was shown in [1]. For 
j = 1 both the transition and the spinodal points 
agrees with the values found in [1]. A learning curve 
for 7 = 0.99 is shown in figure (3). 

At T > the learning behavior is the same as 
for the high temperature limit but with a different 
jo, depending on T, and with e g and q having the 
asymptotic form, 



l-q 



Ca(7,/3) — 
a 

„ s ( In a 

c a ,f» — 



(2.2.2) 
(2.2.3) 



for large a. For details on how to compute the asymp- 
totic form see appendix B. For some range of j, 
J A < 7 < 1, there is a phase transition already at zero 
temperature while in a range Jb < J < Ja there is no 



Figure 3: For the RS-estimate of the perceptron the 
figure shows learning curves at the indicated temper- 
ature (T = 0, 0.3, 0.5, I) for j = 0.99. Only the 
stable state is plotted at T = 0.5 and T = 1. Note 
that for T = there is a critical a above which the 
saddle point equations does not have a solution. 

transition at low temperature. As the temperature is 
increased a transition develops which is illustrated in 
figure (3) for j = 0.99. Finally when j < Jb there 
seems to be no phase transition no matter how high 
the temperature. 

2.3. Replica Symmetry Breaking 

In the RS approximation the entropy will always turn 
negative at some finite a and therefore a region in 
aT-space for which the system exhibits replica sym- 
metry breaking (RSB) is expected, see figure (4). 
Analogous to [1] one step of RSB gives, 
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- ((m - l)qi + 1) qi - —q q + RR 



Dy (2 cosh(_R + \fq~o~t + \/ qi - q yj ) 
t\/qa ~ J 2 R 2 ~ VjR + u^Jqi - q 



V 1 - Si 

where the extremum is taken over R, R, qo, qo, qi 



(2.3.1) 
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Figure 4: Phase diagram for the perceptron at j = 
0.999. Below the solid line the stable state exhibits 
RSB and below the gray line the metastable line ex- 
hibits RSB. To the left of the first spinodal line there 
is only one state in the interior. Below the second 
spinodal line there is only the state close to R = 1. 
In between the two states coexist. Their global prop- 
erties changes as the transition line is crossed. 

qi, and m. As in [1] the limit q\ — ► 1, q\ — ► oo 
is considered, implying that the stationary points of 
fnsB are given by the stationary points of fas having 
zero entropy (see appendix A for details). 

The learning behavior is analogous to the high tem- 
perature limit but with 70 = 0.995. In appendix B 
the asymptotic form of e g and q is computed, 



In a 

a 



l-« = C 4 (7)[ l — 

a 



(2.3.2) 
(2.3.3) 



When 0.999 < 7 < 1, a tr occurs in between a sp i and 
a sp 2 while for 0.996 < 7 < 0.998, a tr = a S pi, i- e - the 
state with better generalization is stable as soon as 
it appears. 

For the case 7 = 0.05 the critical capacity, a c = 
0.83 (q c = 0.56) is found which is compatible with 
the known results for 7 = [10]. Some values are 
given in table (1) and some typical learning curves 
are given in figure (5). 

In the noiseless limit the result, a sp = 1.67, correct- 
ing a previous result by Seung et. al. [1] (a sp = 1.63). 
The reason for this is given in appendix A. 

It is also interesting to compare with some recently 
reported upper bounds for the Ising perceptron [11]. 
In this article the asymptotical behavior was found 
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Figure 5: Learning curves (e g vs a) for the perceptron 
at the indicated noise levels. The solid line is the 
stable state, the dotted the metastable state, and the 
lower line indicates optimal performance. 
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Table 1: For one step of RSB for the perceptron the 
table shows some values of a tr , a sp i and a sp 2 for 
ferent 7 
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to be the same as (2.3.2). The authors found that 
the phase transition disappeared below 7 = 0.998 
thus not only predicting the correct qualitative be- 
havior but also giving a tight quantitative bound on 
70. Also, at 7 = 0.998, they found a tr = 2.6136 
whereas a tr = 1-83 is obtained at 70 given above and 
using the replica method. 



3. A Binary Tree Committee Machine. 

Let the student and the teacher have N (K) and M 
(L) input (hidden) units respectively, with N < M , 
and K < L. We can think of the student (teacher) 
as a committee of binary perceptrons each of which 
has N I K (M I L) input units. As the /th perceptron 
in the teacher is presented an input s; the teacher 
evaluates, 



t(si, ...,s l ) 



1=1 \ m = l 



M/L 



sgn 



while the student computes, 



^Im &lm 



(3.1) 




(3.2) 



as the Mh perceptron in the student is given the input 
sj, ^ . Here s; and vi are elements of 1Z M I L whereas 
a zero superscript indicates that the vector is an el- 
ement of 1Z N I K . When the student is presented the 
same set of input vectors, ^; (/ = 1, ...,£), as the 
teacher it only considers the first N/K components 
of the first K vectors in that set, ^ ( -°- ) (/ = 1, ...,K). 
Analogous to the simple perceptron we find that this 
is equivalent to learning a noisy target rule, 




7 = 'K I L is simply the relative number of hid- 
den units of the student to the teacher while 8 = 
\J N L / (K M ) is the relative number of input units of 
a perceptron in the student committee to a percep- 
tron in the teacher committee. Thus 7 quantifies the 
noise in the hidden layer and 8 the noise in the input 
layer. If 7 = 8 = 1 the realizable case is recovered. 

Using these parameters the generalization error is 
found, 



(3.3) 

where rj and rjk are independent Gaussian random 
variables with variance, 



(V 2 



L - K 

K 
KM _ 
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(3.4) 
(3.5) 



— arccos [R e 

7T 



(3.6) 



where the effective order parameter is given by R e = 
■2- 7 arcsin((5i?) and R is the typical overlap between 

wf ^ and vf ^ . Here, analogous to Schwarze and Hertz 
[2], it is assumed that R is independent of the hidden 
unit index k. 

As for the perceptron case the high temperature 
limit is considered first. Then by using the replica 
method, the RS approximation is studied and finally 
the corrections given by one step of RSB are dis- 
cussed. 

3.1. High Temperature Limit. 

Taking the limits T —> 00 and a — ► 00 while keeping 
a/3 fixed the free energy is found, 



— ln(— ) + — ln(— ) 

a/3 



arccos(i? e ) 



(3.1.1) 



If the noise level is low enough, there exists two spin- 
odal points, a sp i and a sp 2, with a phase transition 
in between. In contrast to the perceptron one find 
that if there is no input noise (8 = 1) there is a phase 
transition to optimal performance at a finite a for all 
values of 7. Given a 7 and that 8o(j) < 8 a tran- 
sition to a state approaching optimal learning in the 
large a limit is found. For 8 < 8o(j) the transition 
vanishes and e g approaches e opt as a tends to infin- 
ity. Especially if 8 > 6a = <5o (0) there is always a 
phase transition while for 8 < 8b = ^o(l) there is 
no phase transition independent of the hidden noise. 
By the same procedure as in section (2.1) one find 
8a = 0.965, 8b = 0.924 and ^0(7) as shown in fig- 
ure (6). Also here a sp i, a tr and a sp 2 increase with 
increasing noise. 
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Figure 6: In the high temperature limit for the tree 
committee machine the system undergoes a phase 
transition at some a tr for noise levels above the 
curve. Below it the system will smoothly approach 
e op t as a — > oo. 

3.2. Replica Symmetric Theory. 

Analogous to Schwarze and Hertz [2] the RS estimate 
to the free energy is found, 



Pfns= eXtr \G r (R,q,a,y,/3) + G,(R,R,q,q) 

R,R,q,q 



2a / Du H 



R e u 



V u 



1 - 1e 



G s = -(l-q)q + RR 



Du ln|2coshfi?- 



qu 



V{x) = In [e 13 + (1 - e' 13 ) H(x)] 



(3.2.1) 



where R e is given above and q e = -^-arcsing. The 

value of q is the typical correlation between two wf ^ 
which are assumed to be independent of the percep- 
tron index k. The interpretations of R, 7, 8 and a 
are as given above. By using the equations generated 
by the extremal condition in (3.2.1) to eliminate R 
and q we can find the dependence of R (and e g ) on a 
given 7, 8 and [3. 

For T = one should, as for the perceptron, 
find a critical capacity, a c , beyond which the stu- 
dent can not perform optimally on the training set. 
However, this is not the case implying that the RS- 
approximation is bad. In the realizable case the val- 
ues of both the transition and the spinodal point 
agree with [2] 



At T > the behavior is much the same as in the 
high temperature limit with the exception that for 
6=1 the transition is not to an optimal state but 
to a state approaching optimal learning as a tends to 
infinity. The asymptotical form of e g and q for large 
a can be found for 8 = 1, j < 1, 



tg t pt 



l-q 



/In a 
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and for 8 < 1, j < 1, 
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(3.2.2) 
(3.2.3) 

(3.2.4) 
(3.2.5) 



The asymptotic behavior can be found by the same 
method used in appendix B. As for the perceptron 
there is a range of noise levels for which there is no 
phase transition at low temperature but one is de- 
veloped as the temperature is increased. One such 
example (7 = 1, 8 = 0.99) is used in the phase dia- 
gram (7). If the noise is increased above some value 
there seems to be no phase transition no matter how 
high the temperature. 

3.3. Replica Symmetric Breaking. 

As was said in the previous section the RS- 
approximation fails in predicting a critical capacity. 
Also, the entropy will turn negative at some finite a 
and thus RSB is expected. In figure (7) a phase dia- 
gram for 7 = 1, 8 = 0.99 shows the RSB region. For 
one step of RSB, in the limit q\ — ► 1, q\ -^00, the 
free energy is, 
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+ G S (R, R, q , q , m) 
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J Du In |^2 cosh (rnR + m\/q~uj J , 

In [e fim + (1 - e~ Pm ) H{x)] , (3.3.1) 
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Figure 7: Phase diagram for the tree committee ma- 
chine at 8 = 0.99, 7 = 1.0. Below the solid line the 
stable state exhibits RSB and below the gray line the 
metastable line exhibits RSB. To the left of the first 
spinodal line there is only one state in the interior. 
Below the second spinodal line there is only the state 
close to R = 1. In between the two states coexist. 
Their global properties changes as the transition line 
is crossed. 
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Figure 8: In the RSB-case of the tree committee ma- 
chine the system undergoes a phase transition at some 
a tr at noise levels above the curve. Below it the sys- 
tem will smoothly approach e opt as a — ► oo. 



For reasons analogous to those given in [1] for the 
perceptron the stationary points of /rsb are given 
by the stationary points of fas having zero entropy. 

In contrast to the RS-case, but analogous to the 
high temperature limit, a transition to optimal learn- 
ing is found for all j if 8 = 1. Using the same notation 
as in section (3.1) the values of 6a and 8b are 0.9995 
and 0.9847 respectively, and 60(7) is given in figure 

(8) . 

For the case 7 = 8 = 0.05 the critical capacity, 
a c = 0.95 (q c = 0.31) is found which is compatible 
with known results for 7 = 8 = [12]. Typically a sp i, 
a tr and a sp 2 increase with increasing unrealizability 
until 8 = 8o(j) where a sp i = a tr = a sp 2- Some 
values of a sp \, a tr and a sp 2 are given in in table (2), 
and some typical learning curves are given in figures 

(9) and (10). 

As a — ► 00 the asymptotic forms of e g and q are, 
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(3.3.2) 
(3.3.3) 



Appendix B gives details of how to compute the 
asymptotic behavior, using the perceptron as an ex- 
ample. In the realizable limit (8 = y = I) the re- 
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Figure 9: learning curves (e g vs a) for the tree com- 
mittee machine at 8 = 1 and at the indicated 7. The 
solid line is the stable state, the dotted the metastable 
state, and the lowest line in each graph indicates op- 
timal performance. 
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suit a sp = 2.79, correcting the value found in [2] 
(a sp = 2.58). The reason for the correction is given 
in appendix A, where the perceptron is used as an 
example. 
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Figure 10: Learning curves (e g vs a) for the tree 
committee machine at j = 1 and at the indicated 
8. The solid line is the stable state, the dotted the 
metastable state, and the dashed-dotted line indicates 
optimal performance. 
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1.0000 


1.0000 


0.00 
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0.9900 


0.00 
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0.9000 


0.00 


3.20 
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0.00 
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1.07 
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1.84 
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4.17 




0.5000 


20.35 


20.35 


24.10 


0.9990 


1.0000 


0.83 


1.12 


2.81 




0.9900 


1.03 


1.14 


2.93 




0.9000 


2.96 


2.96 


4.21 


0.9900 


1.0000 


2.82 


2.82 


3.10 




0.9900 


3.03 


3.03 


3.23 



Table 2: For one step of RSB for the tree committee 
machine the table shows some values of a tr , a sp i and 
®sp2 for different j and 8 



4. Summary. 

In summary we have studied unrealizable learning in 
two large feed-forward neural network, a perceptron 
and a tree committee machine within the replica sym- 
metric ansatz as well as for one step of replica sym- 
metry breaking. The average generalization error has 
been calculated as a function of the load parameter 
a. 

For the perceptron it was shown that using a noisy 
training set results in a generalization error approach- 
ing optimal learning with increasing a according to a 
power law of (In a/a) k with k = 2 in the RSB-case. 
If the noise is low enough there is a phase transition 
at some finite a to a state which is close to R = 1. 
Increasing the noise makes the transition go away. 

For the tree committee machine a similar general- 
ization behavior was found, the main difference being 
that there is always a transition to optimal learning 
at some finite a if there is no noise in the input layer. 
Typically, noise in the input layer gives worse gener- 
alization behavior than noise in the hidden layer. For 
one step of RSB and with noise in the input layer as 
well as in the hidden layer the asymptotic form of e g 
was found to be (In a/a) k with k = 2/3. 

In the realizable cases the values of a sp correct pre- 
viously reported results [1],[2], for the RSB spinodal 
point in the two machines. Here a sp = 1.67 was 
found for the perceptron and a sp = 2.79 for the tree 
committee machine. 

I thank J. Hertz for his valuable advice and direc- 
tion and R. Urbanczik for many useful discussions. 
Also, I would like to thank H. Schwarze for sharing 
the code written in connection to ref. [2] which made 
it possible to sort out the disagreement on the spin- 
odal points. 



A. The Saddle Point Equations 

In the limit q\ — ► 1, q\ — ► oo the one step RSB free 
energy (2.3.1), of the perceptron, is related to the 
RS-estimate (2.2.1) thru [1], 

fRSB(R,R,qo,qo,m,l3) 

= —fRs(R,mR,q ,m 2 q ,(]m). (A.l) 
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Stationarity with respect to R, R, qo and q results in 
the relations qo(T RS B,m,a) = qRs(T RS B /m,a) and 
R(Trsb , m, oi) = Rrs(Trsb /m, oi) while stationar- 
ity with respect to m gives srs(Trsb /m, oi) = 
where srs is the RS entropy. Thus one can find the 
stationary points of /rsb by finding stationary points 
of /rs at a temperature Trs = TrsbI™ for which 
the entropy is zero. The saddle point equations gen- 
erated by (2.2.1) are 



R 



J Dt tanh 2 (i?+ y/q\ 
J Dt tanh(_R + y/qh) 



Tr(l-q) 



Du 



:H 



jRu 



R 



Dt 



u p +H{v)] 
e -y 2 /2 



up + H{y) 



(A.2) 
(A.3) 

2 , (A.4) 
(A.5) 



with up = l/(e' 3 — 1), v = Uyjq/(1 — q) and y = 

tVil-l 2112 )/^- l)- Usm 8 ( A - 2 ) and ( A - 3 ) to 
eliminate q and R in (A.4) an (A.5) gives a system of 

two non-linear equations 



1 
R 



a h(R, q) 
a g(R,q) 



(A.6) 
(A.7) 



At this point we could try to solve for q and R given 
a. However since e g is a many- valued function of a 
it is more economical to eliminate a. This will give 
the equation, 



q g(R, q) = R h(R, q) 



(A.8) 



which can be solved for q given R. a can be evaluated 
using (A.6) or (A.7). The advantage is that e g is a 
single valued function of R. In the RSB-case this will 
be helpful since more than one solution for each a 
has to be considered as we show below. 

Once a stationary point has been found its second 
order properties has to be checked by computing the 
determinant of the Hessian matrix, H. Assume that 
the correct sign of detif, at R > 0, is given by the 
sign at R = 0. As R is increased, the sign of det H 
will change first at R sp 2 and then again at R sp \. Note 
that R sp 2 < R sp i whereas a sp i < a sp 2- In the regime 
R sp 2 < R < R S pi, fif has a stationary point but it 
has the incorrect curvature. 




0.25 0.5 0.75 1 1.25 1.5 
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Figure 11: The dashed learning curves are computed 
assuming RS for the perceptron at T = 0, 0.265, 
0.330 and 0.365 reading from top to bottom. The solid 
learning curve is the one step RSB solution. At the 
intersection between solid and dashed the entropy for 
the RS solution is zero. R is increased as a RS solu- 
tion is followed from the top left corner. At the dots 
det Hrs changes sign, det Hrsb has the same sign 
every where on the RSB line and is zero at its right 
end. 



Even though the RSB-case is solved by using the 
RS-equations the determinant of the Hessian matrix 
of /3fRSB, detiJijss, has to be used to determine 
the second order properties, det Hrsb consists of the 
second derivatives of [3fRs b with respect to R, q, R, q 
and m whereas det Hrs is computed from the second 
derivatives of fifRs with respect to R, q, R and q. 
Using det Hrs = as the criterion to determine the 
spinodal point (at 7 = 1) would result in the values 
of a sp as given in [1] and [2]. Moreover, insisting on 
this RS-criterion will, for some 7, result in regions of 
a where no solution exist. Thus this procedure fails 
in a disastrous way. However the correct condition, 
det Hrsb = 0, will cure this problem and give a sp as 
given in this paper. 

In the RSB-case det Hrs will have the wrong sign 
close to a sp . This will correspond to points on e g not 
considered in the RS-case since there exists another 
solution (with s > 0) at the same a but with det Hrs 
having the correct sign. This is illustrated (for 7=1) 
in figure (11). 

When the RS-equations are used to solve the RSB- 
case it is not possible to find the value of m (only 
Trs = Trsb /m). Since det Hrsb depends on m it 
can not be computed. However it is possible to show 
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that, 



1 



det H RS b(R, R, go, go, m) = —F(R, R, g , go) ,(A.9) 

m 

and since < m < 1, det Hrsb has the same sign as 
F. 



(1 - Rf 1 ' 2 = exp (-aA 2 (l - R)) , (B.12) 



where A 2 depend only on [3 and 7 and where ~ means 
proportional to in the asymptotic limit of large a. In 
order to solve (B.12) the ansatz, 



B. Asymptotics 

For large a the saddle point equations (A. 2) implies 
that q,R are close to 1 and q, R are large. From 
this the asymptotic form of the free energy (2.2.1) 
for non-zero temperatures can be found, 

Mrs = 6Xtr \G r (R, g, a, 7, (]) + G s (R, R, g, g)L (B.l) 

H,R,I,iL J 



G s = ^(l-q)q + RR- 



r- / R 2 

-vgexp -- 



-R 



1 - 2H 



(B.2) 



Gr = ^ arctan 



Vq ~ 7 2 R 2 
jR 



— — Vl 



H(x) 



dy _i2 
= e y 



/2tt 



o 



rfwln [1 + (e 13 - l)H(w)] 



(B.3) 
(B.4) 
(B.5) 



If 7 < 1 (B.l) generates the saddle point equations, 

?2\ 



1 - R 



l-q = d--= 



2 /J_ ex _R^_ 



2 1 



g 

R 



^V 1 - T 2 



/2^yr^ 



a/3y 



ir\/l — 7 2 

The first two of these can be combined into, 
l-R q 



1 - g R 
Using (B.6) and (B.8)-(B.10) results in, 

1-3 ~ (l-R) 2 , 



(B.6) 

(B.7) 
, (B.8) 
(B.9) 

(B.10) 
(B.ll) 



1 - R( a ) = A 1 -^- + 6(a) 
a 



(B.13) 



is made. For consistency it is important to check 
that 8(a) is of lower order than ln(a)/a. Combining 
(B.12) with the ansatz (B.13) and by choosing Ao = 
I/A2 gives the solution, 



t opt 



1-R~±2L, (B.14) 

a 



Also 8(a) ~ ln[/n(j4o In a) / 'A^ 3 ]/ 'a is found and thus 
8(a) is of lower order. The asymptotic form of 1 — q 
is now easily found using (B.14) and (B.ll). 

In the RSB-case the temperature is given by the 
zero entropy condition and can not be regarded as an 
arbitrary constant. Thus [3 is a function of a and 
combining the saddle point equations (B.6)-(B.10) 
with the asymptotic form of the zero entropy con- 
dition, q ~ (1 — g) -1 / 2 , gives, 



1 



g 

- R 

o5/2 



—= exp (—B 2 a/3) 
Jot 



(B.15) 
(B.16) 

(B.17) 

(B.18) 



where B 2 only depend on 7. Again an ansatz, (3(a) 
Bo ln(a)/a + 8(a) is made which together with Bo 
2/B 2 gives the solution, 



ln ( 



(3(a) . (B.19) 

a 



The asymptotic form of R, q and e g is now found 
from (B.15) and (B.16) giving, 



l-R' 



In a 

a 



(B.20) 



For the tree committee machine the asymptotic 
forms of R, q and e g can be found by the same proce- 
dure but using the asymptotic form of the free energy 
(3.2.1). 
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